Loading Data from Python into CAS

There are many ways of loading data into CAS. Some methods simply invoke actions in CAS that load data from files on the server. Other methods of loading data involve connections to data sources such as databases in CASLibs. But the last method is to load data from the client side using a language such as Python.

There are multiple methods of loading data from Python into CAS. We will be introducing the different ways of getting your data into CAS from the Python CAS client here.

As usual, the first thing we need to do is to import SWAT and create a connection to CAS.


In [1]:
import swat

conn = swat.CAS(host, port, username, password)

Now that we have our connection, let's start loading some data.

Loading Data using Pandas-like Parser Methods

It's possible to load data directly from the connection object that we just created. CAS connection objects support the same data readers that the Pandas package does. This includes read_csv, read_excel, read_json, read_html, etc. The CAS connection versions of these methods are merely thin wrappers around the Pandas versions. The data is read into a DataFrame and then uploaded to a CAS table behind the scenes. The result is a Python CASTable object. Let's look at reading a CSV file first. Keep in mind, that using a URL to specifying the file will download the URL to a local file then upload it to CAS. This won't matter much with a small data set such as this, but if you have larger data sets, you may want to use a different method of loading data.


In [2]:
tbl = conn.read_csv('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv')
tbl


Out[2]:
CASTable('_T_8B0F9FAB_7FFB1018E0C0', caslib='CASUSERHDFS(kesmit)')

In [3]:
tbl.head()


Out[3]:
Selected Rows from Table _T_8B0F9FAB_7FFB1018E0C0
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
2 Audi A41.8T convertible 2dr Sedan Europe Front 35940.0 32506.0 1.8 4.0 170.0 23.0 30.0 3638.0 105.0 180.0
3 Audi A6 3.0 4dr Sedan Europe Front 36640.0 33129.0 3.0 6.0 220.0 20.0 27.0 3561.0 109.0 192.0
4 Audi A6 2.7 Turbo Quattro 4dr Sedan Europe All 42840.0 38840.0 2.7 6.0 250.0 18.0 25.0 3836.0 109.0 192.0

As you can see, getting CSV data into CAS from Python is pretty simple. Of course, if you have a large amount of data in CSV format, you may want to load it from a file on the server, but this method works well for small to medium size data.

To read data from an HTML file, we'll use the read_html method. We also have to add the [0] to the end because the read_html will return a CASTable object for each table in the HTML file. We just want the first one.


In [4]:
htmltbl = conn.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html')[0]
htmltbl.head()


Out[4]:
Selected Rows from Table _T_8B0FB22B_7FFB1018E0C0
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date Loss Share Type Agreement Terminated Termination Date
0 The Woodbury Banking Company Woodbury GA 11297 United Bank August 19, 2016 August 30, 2016 none
1 Hometown National Bank Longview WA 35156 Twin City Bank October 2, 2015 April 13, 2016 none
2 Doral BankEn Espanol San Juan PR 32102 Banco Popular de Puerto Rico February 27, 2015 May 13, 2015 none
3 Northern Star Bank Mankato MN 34983 BankVista December 19, 2014 January 6, 2016 none
4 GreenChoice Bank, fsb Chicago IL 28462 Providence Bank, LLC July 25, 2014 July 28, 2015 none

You'll notice that the 6th and 7th columns contain dates. We can use the parse_dates= parameter of read_html to convert those columns into CAS datetime values.


In [5]:
htmltbl = conn.read_html('https://www.fdic.gov/bank/individual/failed/banklist.html', 
                         parse_dates=[5, 6])[0]
htmltbl.head()


Out[5]:
Selected Rows from Table _T_8B0FC63D_7FFB1018E0C0
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date Loss Share Type Agreement Terminated Termination Date
0 The Woodbury Banking Company Woodbury GA 11297 United Bank 2016-08-19 2016-08-30 none
1 Hometown National Bank Longview WA 35156 Twin City Bank 2015-10-02 2016-04-13 none
2 Doral BankEn Espanol San Juan PR 32102 Banco Popular de Puerto Rico 2015-02-27 2015-05-13 none
3 Northern Star Bank Mankato MN 34983 BankVista 2014-12-19 2016-01-06 none
4 GreenChoice Bank, fsb Chicago IL 28462 Providence Bank, LLC 2014-07-25 2015-07-28 none

Here is the result of the table.columninfo action showing the data types of each column.


In [6]:
htmltbl.columninfo()


Out[6]:
§ ColumnInfo
Column ID Type RawLength FormattedLength Format NFL NFD
0 Bank Name 1 varchar 90 90 0 0
1 City 2 varchar 17 17 0 0
2 ST 3 varchar 2 2 0 0
3 CERT 4 int64 8 12 0 0
4 Acquiring Institution 5 varchar 65 65 0 0
5 Closing Date 6 datetime 8 16 DATETIME 0 0
6 Updated Date 7 datetime 8 16 DATETIME 0 0
7 Loss Share Type 8 varchar 7 7 0 0
8 Agreement Terminated 9 varchar 9 9 0 0
9 Termination Date 10 varchar 43 43 0 0

elapsed 0.00563s · user 0.003s · sys 0.015s · mem 0.766MB

All of the other read_XXX methods in Pandas work from CAS connection objects and support the same options. See the Pandas documentation for the other data reader methods and options.

Let's move on to the next method of loading data: the upload method.

Loading Data Using the upload Method

The next way of loading data is by sending a data file to the server using the upload CAS object method. In this style of loading data, the parsing of the file is done on the server-side; the file is simply uploaded as binary data and the table.loadtable action is run in the background. This has the benefit that the parsing can be done with much faster parsers than what can be done in Python. Some parsers can even execute in parallel so the parsing load is split across nodes in the CAS grid. The downside is that there aren't as many parsing options as what the Pandas data readers offer, so both methods still have their place. Let's look at code that uploads data files into CAS tables.

We'll start off with CSV files since they are ubiquitous.


In [7]:
out = conn.upload('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv')
out


NOTE: Cloud Analytic Services made the uploaded file available as table CARS in caslib CASUSERHDFS(kesmit).
NOTE: The table CARS has been created in caslib CASUSERHDFS(kesmit) from binary data uploaded to Cloud Analytic Services.
Out[7]:
§ caslib
CASUSERHDFS(kesmit)

§ tableName
CARS

§ casTable
CASTable('CARS', caslib='CASUSERHDFS(kesmit)')

elapsed 0.13s · user 0.253s · sys 0.226s · mem 259MB

When using the upload method on the CAS object, you'll get the result of the underlying table.upload action call. To get a reference to the CASTable object in that result, you just grab the casTable key.


In [8]:
tbl = out['casTable']
tbl.head()


Out[8]:
Selected Rows from Table CARS
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura RSX Type S 2dr Sedan Asia Front 23820.0 21761.0 2.0 4.0 200.0 24.0 31.0 2778.0 101.0 172.0
2 Acura TSX 4dr Sedan Asia Front 26990.0 24647.0 2.4 4.0 200.0 22.0 29.0 3230.0 105.0 183.0
3 Acura TL 4dr Sedan Asia Front 33195.0 30299.0 3.2 6.0 270.0 20.0 28.0 3575.0 108.0 186.0
4 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0

The upload method works with any file type that the table.loadtable action supports, but you can also upload a Pandas DataFrame. One thing to note about doing it this way is that the DataFrame data is exported to CSV and then uploaded to CAS, so the resulting table will follow the same rules and limitations as uploading a CSV file.


In [9]:
import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randn(50, 4), columns=list('ABCD'))
df.head()


Out[9]:
A B C D
0 0.702470 -0.649162 -0.666066 1.319843
1 1.285588 -0.326041 1.249939 0.110953
2 -0.212528 -0.180530 -0.586989 0.641180
3 -1.227102 2.422193 0.026077 -0.173956
4 -0.724691 0.011051 0.868402 0.512783

In [10]:
dftbl = conn.upload(df)['casTable']
dftbl


NOTE: Cloud Analytic Services made the uploaded file available as table TMP3I2DL8M2 in caslib CASUSERHDFS(kesmit).
NOTE: The table TMP3I2DL8M2 has been created in caslib CASUSERHDFS(kesmit) from binary data uploaded to Cloud Analytic Services.
Out[10]:
CASTable('TMP3I2DL8M2', caslib='CASUSERHDFS(kesmit)')

In [11]:
dftbl.head()


Out[11]:
Selected Rows from Table TMP3I2DL8M2
A B C D
0 0.702470 -0.649162 -0.666066 1.319843
1 1.285588 -0.326041 1.249939 0.110953
2 -0.212528 -0.180530 -0.586989 0.641180
3 -1.227102 2.422193 0.026077 -0.173956
4 -0.724691 0.011051 0.868402 0.512783

In [12]:
dftbl.columninfo()


Out[12]:
§ ColumnInfo
Column ID Type RawLength FormattedLength NFL NFD
0 A 1 double 8 12 0 0
1 B 2 double 8 12 0 0
2 C 3 double 8 12 0 0
3 D 4 double 8 12 0 0

elapsed 0.00537s · user 0.006s · sys 0.012s · mem 0.755MB

Using Data Message Handlers

The last method of loading data into CAS from Python is using something called "data message handlers" and the addtable action. The addtable action is a bit different than other actions in that it sets up a two-way communication from the server back to the client. Once the action is invoked, the server asks for chunks of data until there isn't any more data to upload. Doing this communication can be a bit tricky, so the SWAT package has some pre-defined data message handlers that can be used directly or subclassed to create your own.

You may not have realized it, but you have already used data message handlers in the first example. The read_XXX methods all use data message handlers to load the data into CAS. You can see all of the supplied data messsage handlers by looking at the subclasses of swat.cas.datamsghandlers.CASDataMsgHandler.


In [13]:
from swat.cas import datamsghandlers as dmh

print(dmh.CASDataMsgHandler.__subclasses__())
print(dmh.PandasDataFrame.__subclasses__())


[<class 'swat.cas.datamsghandlers.PandasDataFrame'>, <class 'swat.cas.datamsghandlers.DBAPI'>]
[<class 'swat.cas.datamsghandlers.JSON'>, <class 'swat.cas.datamsghandlers.CSV'>, <class 'swat.cas.datamsghandlers.SQLTable'>, <class 'swat.cas.datamsghandlers.Excel'>, <class 'swat.cas.datamsghandlers.FWF'>, <class 'swat.cas.datamsghandlers.SAS7BDAT'>, <class 'swat.cas.datamsghandlers.HTML'>, <class 'swat.cas.datamsghandlers.SQLQuery'>, <class 'swat.cas.datamsghandlers.Clipboard'>, <class 'swat.cas.datamsghandlers.Text'>]

You may notice that most of the data message handlers are just subclasses of the PandasDataFrame message handler. This is because they all convert their data into a Pandas DataFrame before uploading the data. This also means that the data must fit into memory to use them. There is one data message handler that doesn't work that way: swat.cas.datamsghandlers.DBAPI. The DBAPI message handler uses a Python DB-API 2.0 compliant database connection to retrieve data from a database and loads that into CAS. It only loads as much data as is needed for a single chunk. We'll just be looking at the CSV data message handler here. The rest follow the same pattern of usage.

To use the CSV data message handler, you just pass it the path to a CSV file. This creates an instance of the data message handler that can be used to generate the arguments to the addtable action and also handles the request from the server.


In [14]:
cvs_dmh = dmh.CSV('https://raw.githubusercontent.com/sassoftware/sas-viya-programming/master/data/cars.csv')
out = conn.addtable(table='cars', replace=True, **cvs_dmh.args.addtable)

The action call arguments may look a little odd if you haven't used Python a lot. The args.addtable property of cvs_dmh contains a dictionary of parameter arguments. When you use the double-asterisk before it, it tells Python to apply the dictionary as keyword arguments to the function. The arguments generated by the data message handler are shown below.


In [15]:
cvs_dmh.args.addtable


Out[15]:
{'datamsghandler': <swat.cas.datamsghandlers.CSV at 0x7f0173c9cd30>,
 'reclen': 160,
 'vars': [{'length': 16,
   'name': 'Make',
   'offset': 0,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 16,
   'name': 'Model',
   'offset': 16,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 16,
   'name': 'Type',
   'offset': 32,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 16,
   'name': 'Origin',
   'offset': 48,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 16,
   'name': 'DriveTrain',
   'offset': 64,
   'rtype': 'CHAR',
   'type': 'VARCHAR'},
  {'length': 8,
   'name': 'MSRP',
   'offset': 80,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Invoice',
   'offset': 88,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'EngineSize',
   'offset': 96,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Cylinders',
   'offset': 104,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Horsepower',
   'offset': 112,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'MPG_City',
   'offset': 120,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'MPG_Highway',
   'offset': 128,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Weight',
   'offset': 136,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Wheelbase',
   'offset': 144,
   'rtype': 'NUMERIC',
   'type': 'SAS'},
  {'length': 8,
   'name': 'Length',
   'offset': 152,
   'rtype': 'NUMERIC',
   'type': 'SAS'}]}

The output of the addtable action includes a casTable key just like the upload method.


In [16]:
tbl = out['casTable']
tbl.head()


Out[16]:
Selected Rows from Table CARS
Make Model Type Origin DriveTrain MSRP Invoice EngineSize Cylinders Horsepower MPG_City MPG_Highway Weight Wheelbase Length
0 Acura MDX SUV Asia All 36945.0 33337.0 3.5 6.0 265.0 17.0 23.0 4451.0 106.0 189.0
1 Acura 3.5 RL 4dr Sedan Asia Front 43755.0 39014.0 3.5 6.0 225.0 18.0 24.0 3880.0 115.0 197.0
2 Audi A41.8T convertible 2dr Sedan Europe Front 35940.0 32506.0 1.8 4.0 170.0 23.0 30.0 3638.0 105.0 180.0
3 Audi A6 3.0 4dr Sedan Europe Front 36640.0 33129.0 3.0 6.0 220.0 20.0 27.0 3561.0 109.0 192.0
4 Audi A6 2.7 Turbo Quattro 4dr Sedan Europe All 42840.0 38840.0 2.7 6.0 250.0 18.0 25.0 3836.0 109.0 192.0

Just as with the read_XXX methods on the connection, the PandasDataFrame-based data message handlers take the same arguments as the read_XXX methods to customize their behavior. Using Pandas' HTML reader as an example again, we can upload data with dates in them. Note that in this case, we can specify the index of the HTML table that we want in the index= parameter.


In [17]:
html_dmh = dmh.HTML('https://www.fdic.gov/bank/individual/failed/banklist.html', index=0, parse_dates=[5, 6])
out = conn.addtable(table='banklist', replace=True, **html_dmh.args.addtable)
tbl = out.casTable
tbl.head()


Out[17]:
Selected Rows from Table BANKLIST
Bank Name City ST CERT Acquiring Institution Closing Date Updated Date Loss Share Type Agreement Terminated Termination Date
0 The Woodbury Banking Company Woodbury GA 11297 United Bank 2016-08-19 2016-08-30 none
1 Hometown National Bank Longview WA 35156 Twin City Bank 2015-10-02 2016-04-13 none
2 Doral BankEn Espanol San Juan PR 32102 Banco Popular de Puerto Rico 2015-02-27 2015-05-13 none
3 Northern Star Bank Mankato MN 34983 BankVista 2014-12-19 2016-01-06 none
4 GreenChoice Bank, fsb Chicago IL 28462 Providence Bank, LLC 2014-07-25 2015-07-28 none

Conclusion

In this article, we have demonstrated three ways of getting data from Python into CAS. The first was using the read_XXX methods that emulate Pandas' data readers. The second was using the upload method of the CAS connection to upload a file and have the server read it. The third was using data message handlers to add data using the addtable action. Each method has its strengths and weaknesses. The upload method is generally the fastest, but has the most limited parsing capabilities. The read_XXX methods are quick and convenient and give you the power of Pandas data readers. While we didn't get deep into data message handlers here, they allow the most power by allowing you to subclass them and write custom data loaders from any data source.


In [18]:
conn.close()

In [ ]: